arXiv:1506.08350v2 [cs.LG] 12 Jan 2016 


1 


Stochastic Gradient Made Stable: 

A Manifold Propagation Approach for 
Large-Scale Optimization 

Yadong Mu, Member, IEEE, Wei Liu, Member, IEEE, and Wei Fan, Member, IEEE 


Abstract —Stochastic gradient descent (SGD) hoids as a ciassicai method to buiid iarge scaie machine iearning modeis over big data. 
A stochastic gradient is typicaiiy caicuiated from a iimited number of sampies (known as mini-batch), which potentiaiiy incurs a high 
variance and causes the estimated parameters bounce around the optimai soiution. To improve the stabiiity of stochastic gradient, 
recent years have witnessed the proposai of severai semi-stochastic gradient descent aigorithms, which distinguish themseives from 
standard SGD by incorporating giobai information into gradient computation. In this paper we contribute a novei stratified 
semi-stochastic gradient descent (S3GD) aigorithm to this nascent research area, acceierating the optimization of a iarge famiiy of 
composite convex functions. Though theoreticaiiy converging faster, prior semi-stochastic aigorithms are found to suffer from high 
iteration compiexity, which makes them even siower than SGD in practice on many datasets. In our proposed S3GD, the 
semi-stochastic gradient is caicuiated based on efficient manifoid propagation, which can be numericaiiy accompiished by sparse 
matrix muitipiications. This way S3GD is abie to generate a highiy-accurate estimate of the exact gradient from each mini-batch with 
iargeiy-reduced computationai compiexity. Theoretic anaiysis reveais that the proposed S3GD eiegantiy baiances the geometric 
algorithmic convergence rate against the space and time compiexities during the optimization. The efficacy of S3GD is aiso 
experimentaiiy corroborated on severai iarge-scaie benchmark datasets. 

Index Terms —Large-scaie optimization, semi-stochastic gradient descent, manifoid propagation. 
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1 Introduction 

Regularized risk minimization | [l^ | is a fundamental subject 
in machine learning and statistics, whose formulations typ¬ 
ically admit a combination of a loss function and a reg¬ 
ularization term. This paper addresses a general class of 
convex regularized risk minimization problems which can 
be expressed as a composition: 

w* = argmin {F{w) := P(w^x) -|- i?(w)}, (1) 

W 

in which w, x denote the parameter vector and data vector 
respectively. Both P(w^x) and R{w) are assumed to be 
convex functions. Moreover, let P(w^x) be a weighted 
addition of many atomic loss functions, each of which is 
differentiable. We simply define each atomic function on an 
input data pair (x^, j/i), where x^ S represents a feature 
vector and yi denotes its associated label. Popular choices 
of the loss functions include the square loss (w^x^ — 
the logistic loss log(l -I- exp(—y^w^Xi)), and the hinge 
loss (1 — In the above cases yi G {±1}, yet 

in others yi can be real-valued in regression problems or 
missing in an unsupervised learning setting. i?(w) defines a 
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proper regularization function. It imposes some structural 
preference on the parameters (e.g., structural sparsity or 
matrix low-rankness). i?(w) can be non-smooth with respect 
to w, such as the sparsity-encouraging 1-norm |jw|ji. 

When facing a large volume of training data, the space 
and time complexities become critical limiting factors in 
building a machine learning model. In such scenarios, 
stochastic (sub)gradient descent (SGD) 0, 0, §, 0, 0, 
(Tt) , pi) is a favored method used by many theorist and 
practitioners. The most attractive trait of SGD is the light¬ 
weight computation at each iteration of update. Its single¬ 
sample or mini-batch 0, p7) updating scheme is a general 
remedy for the Oin) complexity in exact gradient descent 
(GD) methods (n represents the number of training sam¬ 
ples). Therefore, SGD algorithms are particularly promising 
when there is a limited budget of resources. Given properly- 
specified step size parameters at each iteration, SGD algo¬ 
rithms often enjoy provably fast rates of convergence. 

The major downside of SGD in practical implementa¬ 
tions is caused by large variance of stochastic gradients. 
Statistically, the mathematical expectation of stochastic gra¬ 
dients is exactly the full gradient. TTowever, the randomness 
in constructing mini-batch brings large variance to stochas¬ 
tic gradients, particularly for complex data set. Moving 
along the direction of a stochastic gradient does not always 
guarantee a decrease of the entire training loss. Under large 
stochastic gradient variance, the estimated parameters often 
drastically bounce around the global optimal solution. 

Recent years have witnessed the emerging efforts of de¬ 
veloping sophisticated algorithms which reduce the stochas¬ 
tic gradient variance in SGD. The shared idea underly- 
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Fig. 1. Illustration of residual-minimizing gradient correction. 
Stochastic gradient calculated from a single random sample 
often significantly deviates from the exact gradient. A simple so¬ 
lution is to compensate the stochastic gradient with the residual 
between the neisy stochastic gradient and full gradient (pletted 
as red dotted arrew in this figure). Exact residual is computation¬ 
ally expensive. Instead, semi-stochastic gradient descent ap¬ 
proximately estimates the residue (pletted as blue dotted arrow 
in this figure), and amends the stochastic gradient accordingiy. 
Best viewing in ceior mede. 

ing these works is incorporating an additional gradient- 
correcting operation when computing the stochastic gra¬ 
dient. The corrected stochastic gradient becomes a more 
accurate approximation of the full gradient. Statistically, it 
enjoys a reduced level of variance. For example, the work 
in p8) explicitly expresses the stochastic gradient variance 
and proves that constructing mini-batch using special non- 
uniform sampling strategy is able to reduce the stochastic 
gradient variance. The sampling probability is essentially 
based on the contextual importance of a sample. Another 
method named stochastic average gradient (SAG) [ p^ | keeps 
a record of historic stochastic gradients and adaptively 
averages them for the use in the current iteration. The rate 
of convergence is thereby improved to 0{l/k) for general 
convex functions, and 0{p^) withp < 1 for strongly convex 
functions, respectively (fc is the count of iterations). For 
atomic functions in special forms {e.g., linear function of the 
data vectors as in linear regression and logistic regression), 
the storage of historic gradients in SAG can be reduced from 
0{nd) to 0{n) (n, d represent the sample count and feature 
dimension respectively). Flowever, generally storing historic 
gradients in SAG entails a heavy burden for machine learn¬ 
ing models with many parameters. 

This paper advocates an efficient manifold propagation 
approach for reducing the stochastic gradient variance in 
large-scale machine learning. It aims to improve the stability 
of the stochastic gradient, such that large descending step 
sizes can be used for faster convergence. We adopt the 
computational framework of residual-minimizing gradient cor¬ 
rection which was originally proposed in stochastic variance- 
reduced gradient (SVRG) 0 by Johnson and Zhang. The 
computational framework is comprised of two steps: 1) 
estimate the residual between a stochastic gradient and the 
full gradient using global information, and 2) compensate 
the stochastic gradient such that the residual is largely 
minimized. 

Since the optimization proceeds in rounds, we can thus 
describe it with an update rule. For simplicity, consider the 
case that each mini-batch only contains a single random 


sample. Assume is the latest estimation for the problem 
min^v F{w) at the fc-th iteration, standard SGD and full 
(sub)gradient descend (GD) will seek for a new estimation 
according tc0 

(SGD) : - r;fcVF,(w'=), (2) 

(GD) : - r;fcVF(w'=), (3) 

where rjk is a delicately-chosen step size. The term Fiiyr^) in 
SGD denotes the atomic function conditioned on a random 
sample and the latest parameters w^. F{'w^ ) is computed 
using all training set. 

In contrast, semi-stochastic gradient is obtained by the 
rule below: 

(Vi^,(w'=) - (VF,(w) - VF(w))), (4) 

"-V-" 

semi— stochastic gradient 

where w represents some historic memory of recent parame¬ 
ter estimation, w is supposed to be proximal to w^. The term 
VFi(w) — VF"(w) approximately estimates the residual be¬ 
tween the stochastic gradient of sample and full gradient. 
By subtracting the residual term from VFi(w^), it naturally 
aligns the stochastic gradient with the full gradient. As an 
extreme case, letting w = will immediately get the full 
gradient in Q. The idea is intuitively explained in Figure 

Theoretic analysis in pj, | [To) , reveals that semi¬ 
stochastic algorithms achieve a geometric rate of conver¬ 
gence. Though such a convergence rate is generally re¬ 
garded as the synonym of satisfactory efficiency, it is im¬ 
portant to emphasize that this rate is achieved at the cost 
of higher iteration complexity compared to standard SGD. 
In our experiments, we are surprised to find that SGD still 
dominates in many cases, since its light-weight iteration cost 
compensates its slow theoretic convergence rate. In other 
words, the promising geometric convergence rate of existing 
semi-stochastic algorithms is probably Pyrrhic victories at 
excessive costs of maintaining high-accuracy estimation of 
gradient residual. 

We find that a comprehensive quantitative comparison 
between semi-stochastic algorithms and SGD is still missing 
in the literature. In fact, most existing semi-stochastic algo¬ 
rithms either rely on periodic full gradient computation 0 
or use Hessian-like covariance matrix operations |[20|, which 
account for their high iteration complexities. In this paper 
we expose a novel way of efficiently computing semi¬ 
stochastic gradient and evaluate it on a variety of massive 
data sets. We term the new method as stratified semi-stochastic 
gradient descent (S3GD) hereafter. Our major contributions 
are described below: 

• As a crucial component of the proposed S3GD, we 
devise an efficient manifold propagation approach 
for computing semi-stochastic gradient. First, a fixed 
number of anchors are drawn in a stratified man¬ 
ner. After that, each sample in the training set is 
connected to its adjacent anchors, forming a graph- 
defined manifold. At each iteration, the gradient 
information computed on the anchors diffuses over 

1. When FJw) is non-smooth, sub-differential (rather than 
gradient) will be used. However, we here abuse the notation 
V for statement conciseness. 
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the manifold, obtaining an approximate estimation 
of the full gradient. The idea empirically proves to 
be a strong competitor to the existing expensive, 
albeit accurate, gradient-correcting operations such 
as SVRG. 

• We provide theoretic analysis about S3GD. Under 
standard assumptions imposed on the objective func¬ 
tions {i.e., strong convexity and Lipschitz continuity) 
and with a constant step size, the objective value 
obtained by S3GD converges to F(w*) -f A at a 
geometric rate, where F(w*) is the global minimum 
of F{w) and A is some quantity determined by the 
quality of anchor-based function approximation. 

• Last but not least, we conduct quantitative investi¬ 
gation over 9 different benchmarks, covering a large 
spectrum of real-world problems. The experimental 
evaluations fully validate the efficiency and effective¬ 
ness that S3GD brings. Moreover, the comparisons 
between various semi-stochastic algorithms and clas¬ 
sic SGD is so far the most comprehensive and sup¬ 
posed to be very useful for re-calibrate the research 
direction of semi-stochastic algorithms. 


The remainder of this paper is organized as follows: 
We start in Section]^ by describing preliminary knowledge 


and algorithmic details of S3GD. Specifically, Section 2.4 is 
devoted to applying the generic idea of S3GD to several 
representative machine learning problems. We then give 
the theoretic analysis in Section |3) where the major obser¬ 
vation is found in Theorem |3.1[ In Section we present 
the quantitative investigation of S3GD on several large- 
scale benchmark datasets widely used in machine learning 
and statistics. Finally, in Section]^ we draw the concluding 
remarks and discuss the future perspective. 


2 The Proposed Algorithm 
2.1 Notations and Assumptions 

Notations: We will denote vectors and matrices by bold-face 
letters. Let ||x|| 2 , ||x||i be the Euclidean norm and 1-norm 
(summation of all absolute elements) of a vector respec¬ 
tively. (a;)+ = max(a::,0) is the zero-thresholding operation. 
Denote the training data set as X = {{xi,yi)}, where i = 
1,..., n. Each sample is described by a tuple (x^, yi), where 
Xi S is the feature vector and yi corresponds to either la¬ 
bels in supervised learning or response values in regression 
problems. The smooth part in Problem Q is premised in 
an additive form, namely P(w) = (1/n) JJiLi yi, wM 
The regularization term R{w) is convex yet not mandatorily 
differentiable. Whenever not incurring confusion, we use 
the notation ijjii'^) for simplifying ijj{x.i, ?/i, w). Throughout 
this paper, by default we use ||x|| to represent the Euclidean 
norm unless otherwise clarified. 

Our theoretic observations are based on the following 
assumptions, similar to previous semi-stochastic gradient 
descent methods fz[\ , | |27| : 

2. P(w^x) and P(w) will be interchangeably used in this 
paper. P(w^x) will be used when we highlight the interplay 
between w and x. Likewise and -i/ijw^Xi) are also 

equivalently used. 


Algorithm 1 The S3GD Algorithm 

1: Parameters: maximal number of inner iterations ki„, the 
number of samples in a mini-batch p and the step-size 
parameter y, 

2: Output: optimal parameter vector w*; 

3: Initialize w = 0; 

4: while not converged do 

5: w° = w; 

6: Calculate the approximate f ull g radient ViT(w) over the 

manifold according to Eqn. j22| ; 

7: for fe = 1 to kin. do 

8: Construct a mini-batch by random sampling. Denote 

the index set as I = {fci,..., fcp}. 

9: Calculate the stochastic gradient for the mini-batch, 

obtaining VPi(w'““^) = (1/p) XiLi 
10: Calculate approximate stochastic gradient for the 

mini-batch on the manifold by Eqn. ( [T^ , obtaining 
Vfti(w) = (i/p)Er=i^^fei(w); 

11: Calculate the semi-stochastic gradient accord¬ 

ing to Eqn. 1^; 

12: Solve the foDowing sub-problem: 

w* = argmin i||w - (w*"”^ - y ■ ff(w''“^))||i -|- rjR{w). 

w 2 

13: end for 

14: w 

15: end while 
16: w* w; 


Assumption 2.1. (strong convexity): We say that a function 
f : I—t M is strongly convex, if there exists /i > 0 such that 

for all u,v £ 

.f(u) > /(v) v) -h f ||u-vf, VC e 5/(v), (5) 

where 9/(v) is the sub-differential (set of sub-gradients) at 
point v. The convexity parameter is defined to be the largest 
pL that satisfies the above condition. Let P(w),i?(w) and their 
composition F{w) have non-negative convexity parameters pp, 
Pp and p respectively. It is easily verified that p > Pp -\- pp by 
definition of strong convexity and function composition. 

Assumption 2.2. (smoothness): A function f : i—t R is 

L-smooth if it is differential and there exists a smallest L > 0 
such that it satisfies 

/(u) < /(v) -t- V/(u)(u - v) + f ||u - v||2, (6) 

for all u, V S R"^. Or equivalently, its gradient is L-Lipschitz 
continuous, namely we have 

|lV/(u)-V/(v)|| <L||u-v||. (7) 

Let the Lipschitz parameter for each atomic function ipifw) be 
Li respectively. The Lipschitz parameter for their composition 
P(w) is Lp < (1/n) Tlie regularization term i?(w) is 

mostly assumed to be non-differentiable and thus has no Lipschitz 
parameter. 

2.2 Algorithmic Framework 

The composite optimization problem in Q is of broad in¬ 
terests in machine learning and data mining fields. Nonethe¬ 
less, solving it at optimal convergence speed is non-trivial. 
If we simply treat F{w) as a black-box oracle which only re¬ 
turns the first-order (sub)gradient, there are several off-the- 
shelf tools, including SGD and full (sub)gradient descent. 
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Since full (sub)gradient estimation is extremely expensive 
when huge volume of data is available, recent work has 
focused on stochastic optimization. 

SVRG 1^, as introduced in preceding section, obeys the 
update rule in Q. Procedurally it utilizes two nested loops. 
At each iteration of the outer loop, it memorizes a recent 
estimation w and calculates the full gradient Vi^(w) at w. 
In the inner loops, it calculates VFi(w^) and VFi(w) for 
mini-batches, and afterwards amends the stochastic gradi¬ 
ent VPi(w^) by the rule in (|^. Note that the same w is used 
for all updates within an outer loop. The SVRG method, 
though simple, profoundly reduces the amortized time com¬ 
plexity at iterations and theoretically achieves geometric 
convergence rate for strongly-convex smooth functions. 

Another semi-stochastic algorithm, stochastic control vari¬ 
ate (SCV) (20), represents a general approach of using control 
variate for reducing the variance of stochastic gradients. 
The update rule of SCV is similar to Q yet the last two 
(sub)gradients in Q are replaced by control variate. Data 
statistics such as low-order moments (vector mean and 
covariance matrix) can be used to form the control variate. 
The authors apply SCV to solve logistic regression and latent 
Dirichlet allocation. 

However, existing semi-stochastic methods like SVRG 
and SCV are not guaranteed to beat standard SGD in prac¬ 
tice, since computing VF(w) in SVRG or control variate 
in SCV significantly increases the iteration complexity. To 
overcome the key limitations that dramatically restrict their 
capability in large scale data analysis, we propose S3GD. 
Algorithm [^sketches the pseudo-code of S3GD. 

Before diving into algorithmic details, we want to high¬ 
light two defining traits of S3GD: 


Manifold-oriented gradient approximation: Given the 
composite function F(w), S3GD only computes the gradient 
on the smooth part P(w). For accelerating the computation 
of semi-stochastic gradient in Q, we argue that the key is to 
find a function H{w), whose design principals are: 

1) ViT(w) is a good surrogate to the full gradient of the 
smooth component VP(w), namely ViT(w) « VP(w); 

2) ViT(w) can be efficiently computed; 

3) ViT(w) = i V/ii(w) is additive, where V/ii(w) 
approximates the stochastic gradient of an atomic function. 
Namely, V/ii(w) « Vi/ii(w). 


We defer the construction of function H{w) in Sec- 


hon 


2.3 focusing on the algorithmic pipeline here. At spe- 

n} is randomly gen- 


dfic iteration, an index set 2 C{1, 
erated for constructing a mini-batch. Conditioned on current 
parameter estimation and a recent historical estimation 
w, the semi-stochastic gradient in S3GD is computed by the 
following formula: 


5i(w'',w) = Vipi{w^) - [V/ii(w) - ViT(w)], (8) 


where V/ii(w) = 

XiiGi V/i.i(w)/|Z| are the averaged original / approximate 
stochastic gradients over the index set I respectively. Here¬ 
after we use 5'i(w^) for brevity since the parameter vector 
w can be mostly inferred from the context. 

In fact, 5i(w^) provides an unbiased estimate of 
VP(w^) when I is randomly drawn from [1,..., n] with¬ 
out replacement. Its soundness is naturally fulfilled by the 


additive construction of functions P(w) and H{w). Conse¬ 
quently, the variance of 51 (w*^) becomes 

[ 5 i(w^)j = E ||V'0i(w^) — V/ix(w)|| 

- (e ||vP(w'=) - ViT(w)|P^ 

< e||vV'i(w'=)-V/ ix(w)|p. (9) 

In comparison, the variance of noisy stochastic gradient 
V'0i(w^) in standard SGD is 

Var = E ||vV’i(w'=) - VP(w'=)|p . (10) 

For Far[px(w^)] and Var[V'0i(w^)], the smaller one 
is more favorable. As shown later, to reduce Var [gx(w^)], 
we designate V/ix('w) to be a localized approximation of 
V'0x(w^). It is supposedly closer to V'0x(w^) in compari¬ 
son with the global average VP(w^), particularly when the 
input data set is with rich variety. 


Proximity-regularized linear approximation: After the 
semi-stochastic gradient gx{w^) is computed, we further 
solve the following sub-problem: 


= argmin 


P(w^) -F (px('w*^), w — w* 


j_ 

2r) 


-Fi?(w), 


( 11 ) 


where the first three terms define a proximal regularization 
of the linearized approximation of P{w) around point w^. 
i?(w) is presumably in a good shape such that solving | [TT) is 
trivial. If R{w) is itself composition of several non-smooth 
functions, one can resort to the modem proximal average 
techniques pS]]. Moreover, it is verified that Problem (IT 


can be compactly abstracted by the operation prox() below: 


pj'ox^^(u) = argmin^ ^||w - u|p-F ?7i?(w), (12) 

where u = — 77 • gxi'w^)- 


2.3 Gradient Approximation by Manifoid Propagation 

This section elaborates on a manifold-oriented method 
for approximating the stochastic gradient S/ipii'W'). Our key 
argument is that a universal gradient-approximating func¬ 
tion is either infeasible or inaccurate in general. Our pro¬ 
posed solution is anchor-based gradient approximation over 
non-linear data manifold. The idea has ever been explored 
in other context (such as feature dimension reduction) yet 
not in stochastic optimization before. 

Yu et al. showed in that any Lipschitz-continuous 
function /(x) residing on lower-dimensional manifolds can 
be approximated by a linear combination of function values, 
namely 

/W » 

where Z is a collection of pre-specified anchors. 7z(x) > 0 
is the combination coefficient depending on both the data 
vector X and anchor z. The idea is later generalized in 
the work of locally-linear support vector machine 0, 0, 
where each anchor determines a function (rather than a 
fixed value), namely /(z) in is replaced by an x-varying 
function /z(x). 
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Algorithm 2 Manifold Based Gradient Approximation 

1: Parameters: anchor number m and fe-NN parameter k. 

Anchor Selection 

2: Perform data clustering to obtain m centers Ci, i = 1... m; 
3: for i = 1 to m do 
4: Find anchor Zi by solving 

Zi = argminx ||x-Ci||^, (13) 

where x is from the training data set. 

5: end for 

Sparse Anchor-Sample Graph (ASG) Construction 
6: for i = 1 to n do 

7: For sample x;, find fe-nearest anchors ,..., ; 

8: Learn the Gaussian kernel parameter by 

a = max ( e, inf 's/Wxi - Zj\\ ) , (14) 

V J 

where e is set to be lO”'^ to avoid the trivial case a — 0. 

9: for each fc-nearest anchor z do 

10: Calculate 7 z(xi) = exp(—||xi — z||^/cr^); 

11: end for 

12: Normalize 7 z(xi) to ensure that they sum to 1; 

13: end for 

Gradient Approximation over ASG 

14: Pre-compute the product matrix XM in Eqn. (22} ; 

15: / / for V/ii(w) on a mini-batch 
16: for each Xi in the mini-batch I do 
17: Calculate V/ii(w) by Eqn. (T^ ; 

18: end for 

19: Vfe(w) = EieiVfe.(w)/|2:|; 

20: / / for approximate full gradient VL/(w) 

21: Calculate VJf(w) by Eqn. 


In Problem Q, we assume that each atomic loss function 
is linear with respect to x^. Letting the 

derivative with respect to a scalar u, the stochastic gradient 
of X; with respect to w can be factorized as below: 

\/ipi{w) = ■ Xi. (16) 

Inspired by the factorization in (T^ , we propose to 
establish a manifold over the training data, such that the 
derivative term ^'(w^X;) in (lb) can be efficiently calcu¬ 
lated via sparse information propagation on the manifold. 
Algorithm]^ shows the pseudo-code for the major steps. The 
proposed scheme consists of the following components: 

1) Constructing anchor set: Compared to universal 
gradient approximation, anchor set (13) has a stronger 
representation power by establishing local approximation 
around each anchor point. Let m be the number of anchor 
points, whose optimal value of is mostly dataset-specific. Let 
Z — {zi,..., Zm} be the anchor set. We employ a /c-means 
clustering procedure to obtain m centers in a stratified 
manner. The anchor points are chosen as the nearest samples 
to these centers, since these centers per se are not necessarily 
corresponding to meaningful features. 

2) Anchor-Sample Graph (ASG) Construction: We 

follow the local approximation scheme as described in 
Eqn. (Ts). Specifically, we propose to approximate the term 


'0'(w^Xi) in 1 16 1 by 


-!/''(w^x.,) « ^ 7z(Xi) • V''(w^z). 


(17) 


Each anchor z uniquely determines a localized function 
-(/j'jw^z), whose value varies with respect to different w. 
The coefficient 7 z(xi) controls the contribution of specific 
anchor point in computing '0'(w^x;). Geometrically, an¬ 
chors and all training samples naturally form an anchor- 
sample graph (ASG), where the connectivity strengths are 
controlled by {y^jx)}. In graph-based propagation meth¬ 
ods, it is known that connecting sample with remote anchors 
potentially does harm to the performance | |T^ . Therefore, 
each sample is enforced to only connect to its fc-nearest 
anchors. State differently, most (x) is zero so that the ASG 
is highly sparse. The computation of 7 z(x) is detailed in 
Algorithm^ 

3) Gradient Approximation over ASG: Gombining 
Eqns. (T^ and (Tt) obtains 

VV’i(w) = ip\w^x^) ■ x., 

~ (Ezez 7z(xi) • V''(w^z)) • X;, (18) 

where the right hand side in (Ts) serves as our proposed 
manifold-oriented approximate gradient. Eormally we des¬ 
ignate a surrogate function hi (w) such that 

V/ij(w) = 7z(xi) • ip'iw^zU X,. (19) 

\ze2 / 


Likewise, the approximate full gradient ViL(w) in (|^ 
can be computed by: 

ViL(w) = — V/ii(w) 

n 


1 

n 


E 



7z(X;) • 




1 

n 


E 

z^Z 


72 (x.,)-X,; 


i/;'(w^z). (20) 


Importantly, computing (T^ and (^ is highly efficient 
owing to the sparsity of ASG. It only involves executing the 
derivative function for all anchors in addition to another 
0{m + d) algorithmic operations per sample. In fact, the 
computation in can be further accelerated by per- 
computing the terms irrelevant to w. Let M € be 

the matrix by compiling all coefficients in ASG. Specifically, 
M(f, j) = 7 z^. (x;). Moreover, let 


d2(w) = • ■ ■ ,?/''(w^Z,„)^ 


( 21 ) 


be the vector of anchor derivatives conditioned on param¬ 
eter w. X = (xi,...,x„) € is the feature matrix. 

Eqn. 1 20 \ can be compactly expressed as 


ViL(w) = - • (XM) X d 2 (w), 
n 


( 22 ) 


The product XM G is not varying with respect 

to w and thus can be pre-computed for avoiding redundant 
computation at different outer loops in Algorithm 
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2.4 Instances of Applications 

This section instantiates our proposed algorithm by several 
representative loss functions and regularizations. 

Logistic Loss: It is applicable to either real or binary re¬ 
sponses. We focus on the binary case, where the label 
y = ±.1. For any data vector x, the conditional probability 
of the class label isl3 

p(?/|x; w) = CT(yw^x) := 1/(1 -F exp(- 2 /(w^x))). (23) 

The log-likelihood function is then expressed as T’(w) = 
Er=i^(w^Xi) = E"=i logp(y*|xi; w). According to the 
calculus rule of sigmoid function, the gradient of '!/'(w^Xi) 
is: 

V'!/'(w^Xi) = cr(-y,w^x,) -y.Xi (24) 

'-V-^ 

derivative V"( W^Xi) 


Another solution of smoothing hinge loss is using 
squared hinge loss as adopted by L2-SVM |3|, namely 
(1/2)((1-2 /wTx) + )2, which naturally removes the irreg¬ 
ular point at the risk of over-penalizing large response. Its 
gradient at a sample (x^, yi) is 

V'!/’(w^Xi) = (l-2/jW^Xi)+ -(-j/iXi). (27) 

'-V-' 

derivative pq 

Regularization: The regularization function i?(w) can be 
either smooth {e.g., Tikhonov regularization) or non-smooth 
(e.g., 1-norm regularization). Below we list a few regulariza¬ 
tion functions widely-used in machine learning: 

(Tikhonov) : i?(w) = A||w|| 2 . 

(1-norm) : i?(w) = A||w||i. 

(Elastic net) : i?(w) = A(1 — a)||w||i -F Aq;||w|| 2 . 


Applying the idea of anchor-based approximation in 
Eqn. |T^ , we have 

V'!/’(w^Xi) = (T(-j/iW^Xi) • y^Xj 



which indicates that the approximate stochastic gradient is 
V/ii(w) = I ^ 7z(xi) • cr(-yiw^z) I • yiX,;. (25) 

\zG2 / 

The label and feature vector z are tightly coupled 
in Eqn. pS}, which prohibits the matrix multiplication in 
Eqn. l|22|. To decouple them, the stochastic gradients of 
different samples can be handled according to their labels. 
More formally, let us consider the following two cases: 

Case-V. yi = 1. We have V'(/'(w^Xi) = tT(—w^x^) • 
Xi, where (t(— w^x^) can be efficiently approximated by 
Ez7z(x*) • cr(-wTz). 

Case-2-, yi = —1. Now there is V'0(w^Xi) = cr(w^Xi) • 
(-Xi) = (1 - CT(-wTxi)) • (-Xj) = -Xi -F (T(-W^Xi) • Xi. 
Note that we use the property of sigmoid function a{u) = 
1 — a{—u). It turns out that we can still apply the tricks 
in Case-1 by amending the result with an additional term 
—Xi. In practice, to estimate ViL(w), we can pre-compute 
(l/n)Ei ■ y-=-i^i and use it to compensate the quantity 
calculated from Eqn. ^22) . This way the computation still 
enjoys the high efficacy of matrix-based operations. 

Hinge Loss and Squared Hinge Loss: The loss function 
popularized by SVM is known to be hinge loss (1—y x) +. 
It is non-differentiable due to the irregularity at yw^x = 
1. However, as discovered in hinge loss can be 

smoothed by the loss of “modified logistic regression": 

(1 - yw^x)+ » i log (^1 -F exp(-/3(yw^x - 1))) . (26) 

The approximation residual asymptotically becomes 
zero when t -Foo, therefore we can cast hinge loss into 
the the framework of logistic loss with properly-chosen /3. 

3. For the sake of simplifying notations, we remove the inter¬ 
cept variable by appending an additional dimension of constant 
1 to any feature vector x. 


When parameters w constitute a matrix rather than a 
vector, regularization terms such as matrix nuclear norm Q 
can be applied. However, optimizing with all above reg¬ 
ularization under the proximal operator in (12 has been 
maturely developed. We thus omit more discussion. 


2.5 Algorithmic Complexity 

The iteration complexity of the proposed S3GD depends 
on several parameters: the mini-batch size p, the number 
of anchors m, the Ic-NN parameter in constructing ASG, 
the maximal inner loop count kin and the feature dimen¬ 
sionality d. In the pre-computation, obtaining the matrix 
product XM requires 0{dkn) time complexity. And finding 
the k nearest anchors has a time complexity of 0{nmd). 
At the run-time, the complexity of computing ViL(w) is 
comprised of 0{dm) for evaluating anchor gradients in 
Eqn. (21 > and 0{dm) for multiplying the pre-computed XM 
with a vector dz(yf)- Note that WH{'w) is only calculated 
once at the beginning of each outer loop in Algorithm 
Gomputing V'!/;x(w^) and V/ix(w) in Eqn. ([^ admits a time 
complexity of 0{pd) or O{p{k d)) respectively. 

Most of existing semi-stochastic algorithms rely on two 
nested loop, of which the outer loop incurs exact full gradi¬ 
ent computation or covariance matrix estimation. Eor large 
data, it entails a tremendous 0{nd) or 0{d^) complexity. 
For other sophisticated algorithms that target at improved 
mini-batch construction (such as SSGD p7|), the iteration 
complexity is generally better than ours. However, the lack 
of global information makes these algorithms more sensitive 
to noise in stochastic gradients. 

Regarding the space requirement, the major costs for 
S3GD include 0{dm) for storing the product matrix in 
Eqn. and 0{nk) for recording k nearest anchors for each 
sample. Akin to SVRG and SCV, S3GD does not memorize 
historic gradients. We summarize the space and time com¬ 
plexities for all interested algorithms in Table 


3 Convergence Analysis 

We need two lemmas as below to advance the convergence 
analysis. The first lemma states 

Lemma 3.1. Consider the composite function F{w) as defined 
in Eqn. (|^. It satisfies both Assumptions 2.1 and 2.2. Let w* = 
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SGD SSGD 

SVRG 

scv 

S3GD 

Iteration Time Complexity: 

0(p X d) 0{p X d) 

C> (p X d-F ^ X d) 

O (p X d-Fd^) 

O (p X d-F X d-F p X (A;-F d)) 

Preprocessing Time Complexity: 

- 0{n X m X d) 

- 

- 

0{n X m X d) 

Space Complexity: 

Oid) 0(dm.)-{-0{n) 

0(d) 

0{dffi~ 

0{dm) + 0{nk) 


TABLE 1 

Time/space iteration compiexity for aii algorithms involved in the quantitative study, p, m, n, d denote the size of a mini-batch, the 
number of anchors in S3GD (or the number of clusters in SSGD), the number of training samples, and the feature dimensionality 
respectively, k denotes the anchor fc-NN parameter. Note that for SVRG and SSGD, they both adopt nested loop during the 
optimization, fe™ denotes the maximal iteration count of the inner loop. The mark implies the absence of any pre-processing. 


arg minw F{w) be the optimal solution and L = maxi Li be the 
maximal Lipschitz-continuous constant of i = 1,... ,n. 

Then we have 


1 ''' 

- ^ II V^,(w) - < 2L [F(w) - F(w*)]. (28) 


The key tricks in the proof of Lemma 3.1 were originally 
developed in (9) and a complete proof is found in Lemma 
3.4 of pT| . The other lemma is an extension of above lemma 
with anchor-based gradient approximation. 


Lemma 3.2. Consider the same problem setting as in Lemma [3T] 
The gradient of ft ilyf) is approximated by VLi(w) according to 
Eqn. p^ . Then 


1 

||V/i,(w) - 

< 2L{F{w) — F(w*)) + 4Lmax |iL(w) — T’(w)|, 

W 


where H{w) = ^ h,{w). 


Proof Let us construct an auxiliary function as below, 

a^{w) = hi{xv) - V'iCw*) - VV'i(w*)^(w - w*), (29) 


where i G {1,... ,n}. It is easily verified that Vai(w) = 
Whi{w) — Wipifw*). From the construction of hi{'w), ai(w) 
is Lipschitz continuous with constant L = max^ Li. Based 
on an argument in Theorem 2.1.5 in p5) , we have 


— ||Vai(w)|p < aj(w) - minaj(w). 
ZL/ w 

Averaging over i = 1,... ,n, we obtain 
1 " 


n 

1—1 


< 2L 


^ / L IL 

- Ui (w)-'V min Oi (w) 

n n w 

2 = 1 2 = 1 


(30) 


(31) 


From the definition of ai{w), we have 


1 . 

— adw) = iL(w) — Pfw*) — VP(w*)^(w — w*). 
n 

2 = 1 


By the optimality of w*, a sub-gradient G dJi(w*) 
exists such that VF(w*) + ^* = 0. Therefore 

1 ” 

-y]a,(w) 

2=1 

= IT(w) — F(w*) + (^*)^(w — w*) 

< H{w) — P(w*) -F R{w) — R{w*) 

= iL(w) — P(w)-F F(w) — F(w*) 

< max |iL(w) - P(w)| + P(w) - P(w*), (32) 


where the first inequality follows from the definition of sub¬ 
gradient. 

The second term on the right hand side of ( [^ can be 
bounded as below: 

1 ” 

— y^ minadw) 
n —‘ w 
2=1 

1 "" 

> min — > CLi{'w) 

w n 

2=1 

1 ^ 

= min - V \hi{w) - ipi{w*) - VV'i(w*)^(w - w*)j 
w n ^^ L J 

2=1 

1 "" 

> min - V [L,(w) - '(/’^(w)] (33) 

w n 

2=1 

= min [iL(w) — P(w)] 

> — max |P(w) — P(w)|. (34) 

Combining and pi) completes the proof. 

□ 


The following is our main observation regarding the 
convergence property of the proposed S3GD: 

Theorem 3.1. For compositional function P(w) = P(w) + 
P(w), assume its two components P(w), P(w) have strong 
convexity parameter pLp > Q,pr > 0 and pp ■ pp > 0. ■fi{'w) 
has Lipschitz parameter Li, i = 1,... ,n. Let L = max^ Li be 
the maximal Lipschitz-continuous constant of-fifw). Set the step 
size rj G (0, and the iteration count in the inner loop (denote 
it as m) sufficiently large, such that 


_ 1 
^ {Pp + Pr)v{^ -'^Lr])m 


ALri{m -F 1) ^ 

(1 — TLrj)m 


(35) 


and 


A = 


SffiLlpp -F PR)m 


ffipp -F PR){m - Ar]L{2m -F 1)) - 1 


• ^H.p > 0, (36) 


where Ap.p = maxw|Lf(w) — P(w)| denotes the maximal 
residual between the smooth component P(w) and its anchor- 
based approximation H{w). 

Moreover, use s to index the outer loops. The proposed S3GD 
algorithm will satisfy the following inequality. 


P(w«) - P(w*) -A<p^ [P(wo) - P(w*) - A], (37) 

where p and A are defined in Lqns. p^p6) respectively. 

Proof. The proof is essentially an adaption of Theorem 3.1 
in pT) . Let us consider a single outer loop (indexed by 
s) which consist of m inner iterations in total. Use to 
denote the parameter vector at the fc-th inner iteration. w° 














is initialized as Wg obtained in previous outer loop. Without 
loss of generality, let us consider mini-batches with single 
random sample at each inner iteration. Based on Lemma 3.7 
in 1 ^, we have 

< ||w^“^ — W*|p — 77/ip||w^“^ — W*|p 
- 77 /rp||w'' - w*||^ - 2r]{F{yv^) - i^(w*)) 
- 2 ??A^(w'= - w*) 

< ||w^“^ — w*|p — 277 (F(w^) — F(w*)) 

-2??A^(w'= - w*), (38) 

where we use the notations A^ = — VP(w^“^) and 

Vfe = - [V/iiJwg) - ViL(wg)]. (39) 

To further bound ( [38) , we have to investigate the term 
Afc (w^ — w*). To this end, let us define the proximal full 
gradient update as 

w'= = prox^p(w'=-i - r]VP{w^-^)). (40) 

An argument in Theorem 3.1 of pTj indicates that the 
relation below holds: 

- Afc (w'^ - w*) < r?||Afef - Aj(w'= - w*). (41) 

Applying ( [41) to l |38| obtains 

< ||w'=-i-w*f- 2 ??(F(w'=)-J^(w*)) 

+2772||Afef-2r;A^(w'=-w*), (42) 

Now we further take expectation with respect to the 
random variable ik in l |^ . Since both w^, w* are irrelevant 
to the choice of ik, there is EA^(w^ — w*) = (EA^)^ (w^ — 
w*) = 0. The term || Afcp can be bounded as below: 

E||A,f 

= E|| Vr/:., (w'^-i) - [V/i., (wg) - ViT(wg)] 

-VP(w'=-i)f 

= E||vV- 4 (w'=-i)-V/i,,(wg)||' 

- ||viT(wg) - VP(w'=-i)|p 

< E ||V'i/’ife(w''“^) - V/iiJwg)|| 

< 2 e||v^,,(w'=- 1 )-VV',,(w*)||' 

+ 2 E||Vh,,(Wg)-VV',,(w*)f 

< 4L(F(w'=-i)-F(w*) + F(Wg)-P(w*)) 
-l- 8 LAp p, 

where we introduce the notation Ap p = max^ |iT(w) — 
P(w)| for the brevity of the presentation. 

Therefore l |4^ can be further transformed as 

E||w^ — w*||^ 

< ||w^“^ — w*||^ — 2 r 7 (EP(w^') — P(w*)) 

-h 8 ? 7 ^P(P(w'=-i) - P(w*) -h P(wg) - P(w*)) 
+16r]^LAH.,p. (43) 


After all m iterations have been completed, we set 
Wg+i = „^minP(w''). Summing (43' over 

k = 1,... ,m, we obtain 


-^[nw.)-P(w*)] 

< E||w"* — w*|p — ||w° — w*|p 

m 

< EP(w'=) - P(w*)) 

k^l 

m 

(eP(w'=-1) - F{w*)J 

k^l 

Lm{F {w s) — F(w*) + 16r]^LmAH,p 

m 

< -27/(1 - 4r;P) ^ (^EP(w'=)-P(w*)) 

k=l 

+8t]^L (P(w°) - P(w*) - (EP(w'") - P(w*))) 
+8rj‘^Lm{F{ws) — P(w*) -|- lQrfLmAH,p 

< -27/(1 - 47/P)to (P(Wg+l) - P(w*)) 

+8rf‘L{m + l)(P(Wg) — P(w*) 

-|-167/^Prr7Ap_p, (44) 


where the first inequality follows from the strong convexity 
of P(w) and the last inequality is derived from the construc¬ 
tion of Wgpi. 

Re-arranging ( |44) and using the notations p, A in 1 35 i 36 
respectively, we obtain 

P(wg+i) - P(w*) - A< p [P(wg) - P(w*) - A]. (45) 


35'136 


Recursively expanding the right hand side of above 
inequality obtains the main theoretic result in (|37|. □ 


Remarks: Similar to the original SVRG and its variant 
Prox-SVRG, our proposed S3GD also enjoys an exponential 
rate of convergence. The convergence guarantee S3GD is 
comparably weaker, since it is not guaranteed to converge 
to the exact globally-optimal solutions. Instead, the theoretic 
analysis in Theorem |3 .1 [ essentially states that, when the step 
size 7/ is sufficiently small, the function value-descending 
process from P(wo) to P(w*) + A admits a geometric rate. 
However, the optimization after reaching P(w*) + A is 
beyond the scope of Theorem 3.1 which we will investigate 


through empirical study in the experimental section. In a 
word, our proposed S3GD avoids the intensive computation 
of full gradient by relaxing the optimality guarantees. 

The quantity A is primarily determined by the residual 
between the smooth component P(w) of the composite ob¬ 
jective function and its anchor-based approximation P(w). 
Note that in Eqn. ( [3^ , when rj is proportional to 1/L and m 
is sufficiently large, the coefficient in front of Ah,p tends 
to be a constant, rather than increasing with respect to 
larger m. Moreover, the maximum value of the step size 
7/ is slightly lower than that in Prox-SVRG {rj < 1/{8L) in 
S3GD and 7/ < 1/(4L) in Prox-SVRG). 

To mitigate the effect of A, one can either reduce the 
value of 7/ or switch to the normal Prox-SVRG procedure 
when the estimated parameter vector is close to the global 
optimum. In this paper we leave such investigation to the 
future work, faithfully reporting the performance of S3GD 
using constant step sizes in all experiments. 
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Fig. 2. Investigation of the effect of step sizes on the convergence speed and soiution stabiiity. We seiect CIFAR10 as the testbed 
and report the training objective vaiues under four different step size parameters, it is seen that iarge step sizes often indicate faster 
convergence yet aiso bring the risk of bouncing around the optimum. Variance reduction is thus criticai for using iarge step sizes. 
Note that the objective vaiues are piotted in iogarithmic scaie. Better viewing when eniarged and in coior mode. 


Dataset 

Train/Test Size 

#Dim 

#Class 

ZOnewsgroups 

11,314 / 7,532 

26,214 

20 

IJCNN 

49,990 / 91,701 

22 

2 

WebSpam 

280,000 / 70,000 

254 

2 

CIFARIO 

50,000 / 10,000 

800 

10 

Kaggle-Face 

315,799 / 7,200 

256 

7 

MEDll 

30,000 / 16,904 

5,000 

25 

KDD04 bio 

120,000 / 25,751 

74 

2 

KDD04 phy 

45,000 / 5,000 

65 

2 

Covtype 

500,000 / 81,012 

54 

2 


TAB Lb 2 

Summary of the benchmarks used in the experiments. #Dim 
and #Class represent the number of feature dimensions and 
unique categories respectiveiy. 

4 Experiments 

This section reports the numerical studies between our 
proposed S3GD and other competing algorithms. 

4.1 Description of Dataset and Appiications 

To make the experiments comprehensive, we include nine 
benchmarks that cover a variety of heterogeneous tasks 
and different data scales: 20-Nezvsgrotip^ which contains 
nicely-organized documents from 20 different news topics, 
WebSparudrepresents a large collection of annotated spam or 
non-spam hosts labeled by a group of volunteers, I/CNA|^ 
for time-series data, KDD04Jjio and KDD04_ph^ which 
correspond to the protein homology sub-task and quantum 
physics sub-task in KDD-Cup 2004 respectively, coufyp^ 
which includes cartographic variables for predicting forest 
cover type. We also include three computer vision bench¬ 
marks: CIFARK^foi image categorization, Kaggle-Fac ^^ for 
facial expression recognition and MEDllp^for video event 
detection. 

Table summarizes the critical information for above- 
mentioned benchmarks. For most datasets, we adopt the 

4. http://qwone.com/~jason/20NewsKroups/ 

5. http://www.csie.ntu.edu.tw/~cjlm/libsvmtools/datasets/binary. 
html 

6. http://www.geocities.com/ijcnn/rmcJjcnnOl.pdf 

7. http://osmot.cs.corneU.edu/kddcup/datasets.html 

8. https://archive.ics.uci.edu/ml/datasets/Covertype 

9. http://www.cs.toronto.edu/~kriz/cifar.html 

10. https://www.kaggle.com/c/challenges-in-representation- 
learning-facial-expression-recognition-challenge 

11. http://www.nist.gov/ itl/iad/mig/medll.cfm 


defaulted train/test data split. Regarding the features, we 
either use the feature files provided by the benchmark 
organizers or extract them by ourselves. They may not 
necessarily bring state-of-the-art accuracy since our focus 
is investigating the convergence speed of the optimization 
methods instead of just driving for higher performance. The 
defaulted tasks defined on some benchmarks are multi-class 
classification. In these cases, a one-vs-rest scheme is applied 
to simplify the evaluations. We pick the category with the 
most training samples as the positive class and merge all rest 
categories as the negative class, converting it into a binary 
classification problem. Whenever the positive/negative data 
partitions are heavily unbalanced, we assign samples from 
positive/negative classes different weights such that the 
weight summarizations of the two classes are equal. More 
formally, let be the index sets of positive/negative 

classes respectively. The loss is calculated as 

T,iey+ V'i(w) -F 13^ Eiey. V'i(w). (46) 

In all experiments we stick to using the logistic loss func¬ 
tion and Tikhonov regularization owing to their empirical 
popularity and non-linear property. 

4.2 Baseline Algorithms 

We make comparisons between the proposed S3GD and 
other four competitors, including 

• Mini-Batch Stochastic Gradient Descent (SGD): it 

represents the standard stochastic gradient method. 
At each iteration, the SGD algorithm randomly 
draws p samples from the training set according to 
weight distribution specified in Eqn. l |46| , calculate 
their respective stochastic gradient, and uniformly 
average these stochastic gradients. 

• Stratified SGD (SSGD) p7\ : This method aims to 
improve the standard mini-batch SGD using data 
clustering and stratified sampling. SSGD ensures 
that each iteration will draw at least one sample 
from each data cluster (stratum). The inclusion is to 
contrast different ways of using global information 
about data. For fair comparison we set the number 
of clusters to be p. 

• Stochastic Variance Reduction Gradient (SVRG): 

This original idea work of SVRG is found in Q. 
ITowever, it does not handle non-smooth functions. 
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0.001 


Fig. 3. Iteration time complexities in terms of CPU times on all datasets. The time is recorded in seconds. To highlight subtle 
difference, we adopt logarithmic scale for the vertical axis. The length of time is visualized by a bar that points to its due value. See 
text for more explanations. 


In the comparison we adopt the extension proposed 
in (^ . Inheriting the two nested loops of SVRG, 
one of the key parameters in is the the maxi¬ 
mal iteration number in the inner loop. The authors 
suggest this parameter shall be sufficiently large for 
achieving better loss bound. We fix this parameter to 
be always 50 in all experiments, which empirically 
provides a good balance between convergence speed 
and heavy complexity caused by exact gradient esti¬ 
mation. 

• Stochastic Control Variate (SCV) |@: This is an¬ 
other semi-stochastic gradient method that reports 
state-of-the-art speed and accuracies. The method 
relies on the utilization of data statistics such as 
low-order moments to define "control variate". The 
authors rigourously prove the reduction in noisy 
gradient variance under mild assumptions. Note that 
for features in high dimension, the computation of 
data statistics can be its computational bottleneck. 

4.3 Evaluation Settings 

For all experiments, we fix the parameter A = 10“^ for the 
Tikhonov regularization. The maximal iteration parameter 
kin in the inner loop of S3GD is fixed to be 20. Each mini¬ 
batch contains p = 10 random samples. For S3GD, m = 100 
anchors are generated on all datasets. We implemented all 
baseline algorithms and S3GD in optimized C++ programs. 
The experiments are conducted on shared servers in an in¬ 
dustrial research lab. Each of the servers is equipped with 48 
CPU cores and 400GB physical memory. Five independent 
trials are performed for all algorithms and the averaged 
results are reported. The entire experiments take about one 
day on five servers. 

There are two indices which are utterly crucial for evalu¬ 
ating a gradient based optimization scheme: the correlation 
(or variance) between (semi)stochastic gradient and exact 
gradient, and the maximal step size which ensures the 
stability of the optimization. In the literature of stochastic 
gradient methods, both decayed and constant step sizes 
are widely adopted. We find that tuning the decayed step 
sizes is very tricky, which makes a fair comparison among 
different algorithms difficult. Therefore we focus on the 
results using constant step sizes. 

Large step sizes are always favored in practice since they 
expect improved convergence speed. To see this point, in 


Figure we plot the objective values in each iteration of 
the training stage on CIFARIO. For all baseline algorithms 
and our proposed S3GD, the convergence curves under four 
different constant step sizes r] = {0.1,1, 5,10} are recorded 
and plotted. Obviously SVRG and S3GD are two most stable 
algorithms even operating with large step size parameters. 
All other three algorithms drastically fluctuate when their 
current solutions approach the global optimum, even with 
the moderate parameter r] = 5. This empirical investigation 
highlights the importance of choosing proper step size. 

To fairly compare different algorithms, we evaluate them 
under the parameter set r] G {0.1,1,5,10} and report the 
performance with the largest step size that satisfies the 
following stability condition: 

? 7 * = max 77 s.t. F{w,r]) < (1 -F e)F"(w*), (47) 

where F{w*) denotes the objective value at the global 
optimum w*. F(w; 77 ) is the converged point using step size 
77 . In the experiments we average the last 5,000 optimization 
iterations to obtain F(w; 77 ). e is set to be 0.01 in all cases. 
The condition aims to abandon any step size parameter 
that drives the solution crazily bounce around the global 
optimum. 

Most of prior works pj, report the performance with 
respect to iteration counts. We here argue that the evaluation 
shall take the iteration complexity into account. Recall that 
Table summarizes the time and space iteration complex¬ 
ities for all algorithms. Importantly, the complexities of 
SVRG and SCV are dominated by the exact gradient compu¬ 
tation and class-specific covariance matrix estimation. Both 
of them are expected to take longer time for accomplishing 
each iteration. Figure reports the time for performing 50 
gradient descent iterations for all datasets and algorithms. 
The computing time is obtained by averaging all trials. 
It is observed that on most datasets, the standard SGD 
consumes the least time. SSGD and our proposed S3GD 
use slightly more time compared to SGD. The CPU time of 
SVRG and SCV are significantly larger. Specifically, SVRG 
is especially slow in comparison when facing large scale 
data set (such as Kaggle-face and covtype) and high feature 
dimensions {e.g., 5,000-dimensional features for MEDll and 
26,214-dimensional features for 20newsgroups). Likewise, 
SCV is particularly slow when handling high-dimensional 
features. On the 20newsgroups, SCV requires 594 seconds 
for every 50 iterations, which is beyond the scope of most 
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kaggleface 



Computing Time (in logarithmic scale) 


cifarlO 



Computing Time (in logarithmic scale) 


webspam 



Computing Time (in logarithmic scale) 



covtype 


Computing Time (in logarithmic scale) 



kdd04_bio 



Computing Time (in logarithmic scale) 



Fig. 4. Training objective values for all referred algorithms in this paper on 9 datasets. The horizontal axis correspond to CPU times 
(in seconds) in the logarithmic scale. Note that we do not report the performance of SCV on 20newsgroups since it takes five days 
for accomplishing all 40,000 iterations. Better viewing in color. 


practitioners. In contrast, SGD and S3GD only need 0.21 and 
0.30 seconds respectively. Therefore for fairness in compari¬ 
son, we will majorly concern the performance with respect 
to CPU times. 


4.4 Quantitative Investigations 

Convergence Speed: Figure shows the training objective 
values with respect to the CPU times. For all algorithms, the 
step-size parameters are chosen according to the criterion 
in l |47| . Interestingly, though semi-stochastic gradient meth¬ 
ods are proved to enjoy faster asymptotical convergence 
speed, most of them are not as “economic" as standard 
SGD due to significantly higher iteration complexity. Our 
proposed S3GD exceptionally outperforms all other algo¬ 
rithms on 6 out of 9 datasets. SVRG only dominates the 
small-scale 22-dimensional dataset IJCNN, and SGD yields 
the best performance on other two datasets KDD04_bio and 
20newsgroups. SSGD is found to be sensitive about imbal¬ 
anced data partition, such as MEDll and 20newsgroups, 
where the positive/negative data ratios are 1:25 and 1:20 
respectively. 


It is also important to investigate the generalization abil¬ 
ity of the learned machine learning models. To this end, we 
also periodically record the objective values on the testing 
set. The results are displayed in Fig. from which it is easy 
to visually rule out the possibility that the generalization 
error and training objective value are inconsistent. 

It is surprising that the standard SGD is among the best 
performers on nearly all of 9 datasets despite its simplicity. 
Based on the experiments we argue that the research of 
semi-stochastic algorithms shall investigate the balance of 
larger step size and increased iteration complexity, particu- 
lary in the era of large data and high dimension. 

Corrleation of Gradients: We further study the Pearson 
correlation of (semi)stochastic gradient and the exact gra¬ 
dient. For a semi-stochastic algorithm, the correlation score 
is favored to approach the value of 1, since it indicates a 
better approximation scheme for gradient computation. 

It is clearly observed that SVRG and the proposed 
S3GD exhibit the most favorable correlation scores. More¬ 
over, most methods enjoy relatively larger correlation scores 
when the optimization just begins. The correlation scores 
gradually drop when the optimization proceeds. The reason 
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Fig. 5. Testing objective values for all referred algorithms in this paper on 9 datasets. 


may be that the exact (sub)gradient tends to zero around the 
optimum, which makes accurate gradient approximation 
more challenging. The only exception is SVRG. In all cases 
its correlation scores quickly rise and stay at 1. It may be 
caused by the fact that |jw^'+^ — w^|| tends to zero when 
approaching the global optimum. Therefore, w « in 
Eqn. Q, which implies that the semi-stochastic gradient 
becomes increasingly close to the full gradient. 

Effect of Anchor: Recall that we use 100 anchors obtained 
through clustering in all experiments. One may concern how 
different choices of anchor number affect the performance 
and running time. Figure presents the evolution of cor¬ 
relations scores under different anchor settings on MEDll 
and CIFARIO (step size is fixed to be 1 for all cases). 

Interestingly, we observe that enlarging anchor set does 
not entail boosted correlation scores. In fact, the scores 
will reach its peak around data-specific anchor number 
(100 for MEDll and 20 for CIFARIO) though other choices 
bring alike performances. This implies that the algorithm is 
largely robust to the anchor number though empirical tun¬ 
ing does further help. Figure]^ plots the averaged iteration 
times for different anchor parameters. More anchors entail 
longer CPU time. Yet the time only increases sub-linearly 
owing to other kinds of computational overhead at each 


iteration. 

Effect of inner iteration count kin- Both SVRC and our 
proposed S3CD adopt two-level loops for optimizing the 
objective function. The parameter kin refers to the maximal 
iteration count in the inner loops. Note that we use different 
kin for SVRC and S3CD (50 for SVRC and 20 for S3CD). 
Intuitively, smaller kin accelerates the convergence speed for 
both S3GD and SVRG with respect to the iterations, since it 
indicates more frequent update of the historical parameter 
vector at the outer loop. However, as shown in Table 
SVRG requires the computation of exact gradient at the 
beginning of each outer loop, which implies a complexity of 
0{nd). In contrast, S3GD only has a complexity of 0[dm) 
using the pre-computation trick as in Eqn. ^2\ . Therefore, to 
balance the amortized iteration complexity and convergence 
speed, SVRG tends to prefer larger kin than S3GD. 

To validate this point, Fig.j^plots the objective values on 
the testing data of the CIFARIO benchmark under km = 20 
or kin = 50. Both S3GD and SVRG exhibit similar con¬ 
vergence curves under different choices of km- Intuitively, 
though large km implies more frequent historical parameter 
update, the significantly increased iteration complexity of 
SVRG actually slows down its convergence. 
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Fig. 6 . Pearson correlation scores of (semi)stochastic gradient and the exact gradient on 9 datasets under the parameter 77 = 0.1. 
See text for more explanation. Best viewing in color. 
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Fig. 7. Investigation of how the anchors affect gradient correlation 
scores on MED11 and CIFARIO. 


5 Concluding Remarks 
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Fig. 8 . Average CPU time for every 50 optimization iterations (in 
seconds) for S3GD. 

the high iteration complexity in existing semi-stochastic 
algorithms. It exploits stratified manifold-based gradient 
approximation as a good cure for the time-consuming exact 
gradient calculation. Our work significantly advances the 
original idea of residual-minimizing gradient correction. 
The current paper did not discuss the application in a dis¬ 
tributed computing environment, since it is out of the main 
scope. However, we will explore the distributed variants of 
the proposed S3GD like [ (T8| in the future. Moreover, exten¬ 
sion to non-convex formulations such as deep networks ||^ 
is also a meaningful future direction. 
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